Skip to content

feat: i686 page tables, snapshot compaction, and CoW (standalone)#1385

Open
danbugs wants to merge 6 commits intomainfrom
nanvix-platform
Open

feat: i686 page tables, snapshot compaction, and CoW (standalone)#1385
danbugs wants to merge 6 commits intomainfrom
nanvix-platform

Conversation

@danbugs
Copy link
Copy Markdown
Contributor

@danbugs danbugs commented Apr 16, 2026

Summary

The cleaned-up version of #1381: drops the dependencies on #1379 and #1380 so this PR can land on its own.

Commits

  • refactor: replace nanvix-unstable with i686-guest and guest-counter features
  • feat: i686 protected-mode boot and unified restore path
  • feat: i686 page tables, snapshot compaction, and CoW support
  • fixup: address PR #1381 review comments

What this adds

  • i686 guest support on the x86_64 host: 32-bit protected-mode boot, unified restore path, 2-level page-table walking and snapshot compaction with CoW-resolved PTE reads.
  • Feature rename — the prior nanvix-unstable gates split into i686-guest (PT walker, protected-mode boot, compaction) and guest-counter (scratch counter plumbing). No behavior change for consumers using the old feature flag; Cargo.toml / Justfile / build.rs updated to match.

What it does not include (vs #1381)

Both can land separately; #1381 happened to stack on top of them.

Review items from #1381 addressed here

  • PTE decode uses u32::from_le_bytes (previously from_ne_bytes) — PTEs are little-endian by arch spec.
  • i686-guest feature guarded with a compile_error! on targets that are neither x86 (guest) nor x86_64 (host).
  • restore_snapshot rewrites the scratch PD-roots bookkeeping (SCRATCH_TOP_PD_ROOTS_{COUNT,ARRAY}_OFFSET) so a subsequent snapshot() doesn't see a stale zero count. Snapshot persists the root count; root GPAs are deterministic (layout.get_pt_base_gpa() + i * PAGE_SIZE).

Verification

Built with cargo check -p hyperlight-{common,host,guest} --features kvm,mshv3,executable_heap,i686-guest,guest-counter — clean. End-to-end tested against nanvix@e61306676: Hello 5/5 with NANVIX_REPEAT=4 through the restore+call loop.

ludfjig and others added 4 commits April 16, 2026 20:09
…eatures

Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

Co-authored-by: danbugs <danilochiarlone@gmail.com>
Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

Co-authored-by: danbugs <danilochiarlone@gmail.com>
Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

Co-authored-by: danbugs <danilochiarlone@gmail.com>
- sandbox/snapshot.rs: decode i686 PTEs with `u32::from_le_bytes`
  instead of `from_ne_bytes`. Page-table entries are defined as
  little-endian by arch spec; `from_ne_bytes` is incorrect on
  big-endian hosts and inconsistent with surrounding helpers that
  already use `from_le_bytes`.

- hyperlight_common/vmem.rs: guard the `i686-guest` feature with a
  `compile_error!` on targets that are neither `x86` (the guest
  itself) nor `x86_64` (the host that runs the guest). Previously
  enabling the feature on e.g. aarch64 would silently compile but
  downstream crates would hit confusing missing-item errors when
  they reached for `vmem::i686_guest`.

- mem/mgr.rs, sandbox/snapshot.rs: persist the per-process
  PD-roots count in `Snapshot` and rewrite the scratch bookkeeping
  area (`SCRATCH_TOP_PD_ROOTS_{COUNT,ARRAY}_OFFSET`) during
  `restore_snapshot`. Scratch is zeroed on restore, so without
  this a subsequent `snapshot()` call would read count=0 through
  `read_pd_roots_from_scratch` and fail. Root `i`'s compacted
  GPA is deterministic (`layout.get_pt_base_gpa() + i * PAGE_SIZE`
  — same layout `compact_i686_snapshot` used when building the
  rebuilt PDs), so we only need to store the count.

Signed-off-by: danbugs <danilochiarlone@gmail.com>
@danbugs danbugs added the kind/enhancement For PRs adding features, improving functionality, docs, tests, etc. label Apr 16, 2026
Copy link
Copy Markdown
Member

@simongdavies simongdavies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewers: Opus, Gemini,ChatGPT

All three reviewers agree this is a well-structured PR that cleanly separates the nanvix-unstable feature into i686-guest and guest-counter, and replaces the TODO-laden real-mode boot with proper 32-bit protected mode + paging. The code quality is generally good — comments are thorough, unsafe blocks are contained, and the architecture is sensible.

However, all three independently flagged critical concerns around the CoW flag logic, potential scratch memory layout overlap, and the complete absence of tests for the i686 code path.

🟢 Good Stuff

  • Unified restore_all_state(): Removing the TODO-laden nanvix-unstable branch with its "this is probably not correct" comment is a significant improvement. The CR3-based restore is now clean and shared.
  • compile_error! guard for i686-guest on non-x86 — nice defensive programming.
  • OOB-safe read_entry implementations: Both Snapshot and SharedMemoryPageTableBuffer return 0 (not-present) for OOB addresses rather than panicking. Exactly right for defensively walking guest-controlled page tables.
  • Clean feature flag split: i686-guest + guest-counter are genuinely independent concerns.
  • Thorough doc comments on Snapshot fields (n_pd_roots, separate_pt_bytes).
  • Scratch bookkeeping factoring: Extracting copy_pt_to_scratch() from update_scratch_bookkeeping() is a clean refactor.

🔴 Additional findings (on lines outside the diff)

Snapshot impl of TableReadOps always reads 8-byte entries (snapshot.rs line 131, Flagged by: 2/3)

The impl TableReadOps for Snapshot at line 131 of snapshot.rs unconditionally reads 8 bytes for a PTE. While it appears to only be used in the #[cfg(not(feature = "i686-guest"))] path currently, it's not gated by #[cfg]. If anyone ever calls virt_to_phys with a Snapshot on the i686 path, it will read 8-byte entries from a 4-byte PTE table — reading cross-entry data and producing garbage mappings. Either gate with #[cfg(not(feature = "i686-guest"))] or add a comment + compile_error! guard.

Endianness inconsistency (snapshot.rs line 149, Flagged by: 2/3)

Line 149 of snapshot.rs uses u64::from_ne_bytes() (native endian), while the i686 SharedMemoryPageTableBuffer::read_entry() at line 273 correctly uses u32::from_le_bytes() with a comment noting "Page-table entries are little-endian by arch spec". The x86_64 path happens to work on LE hosts but is technically incorrect by the same reasoning. Should use from_le_bytes consistently.

See inline comments for the remaining findings.

};
let mut flags = (pte & 0xFFF) | extra_flags;
// Mark writable or already-CoW pages as CoW (read-only + AVL bit).
if (flags & RW_FLAGS & !PTE_COW) != 0 || (flags & PTE_COW) != 0 {
Copy link
Copy Markdown
Member

@simongdavies simongdavies Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical: Overly broad CoW flag condition (Flagged by:3/3)

RW_FLAGS = PTE_PRESENT | PTE_RW | PTE_ACCESSED = 0x23. The expression flags & RW_FLAGS & !PTE_COW simplifies to just flags & RW_FLAGS because PTE_COW (bit 9) doesn't overlap with RW_FLAGS. So flags & 0x23 != 0 is true for any present page (since PRESENT is bit 0 and every present PTE has it set).

This means every present page — including genuinely read-only .text/.rodata — gets marked CoW. Read-only pages don't need CoW; they can be shared directly.

Suggested fix — check for PAGE_RW specifically:

if (flags & PAGE_RW as u32) != 0 || (flags & PTE_COW) != 0 {

/// The guest writes this before signaling boot-complete so the host can walk
/// all active PDs during snapshot creation (not just CR3).
#[cfg(feature = "i686-guest")]
pub const SCRATCH_TOP_PD_ROOTS_COUNT_OFFSET: u64 = 0x28;
Copy link
Copy Markdown
Member

@simongdavies simongdavies Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical: PD roots bookkeeping may overlap exception stack (Flagged by: 2/3)

SCRATCH_TOP_EXN_STACK_OFFSET = 0x20 means the exception stack top lives here and grows downward. PD_ROOTS_COUNT_OFFSET = 0x28 is only 8 bytes below the exception stack top.

If any i686 guest ever uses exceptions that push state to this stack, the first push would land at offset 0x28 — exactly on top of PD_ROOTS_COUNT. This would corrupt the PD roots count, causing the host to read garbage PD root GPAs.

Suggested mitigations:

  • Move PD roots to a dedicated area well below the exception stack (e.g. offset 0x200+)
  • Or document the invariant explicitly with a compile-time check that the exception stack size is bounded
  • Or reserve explicit space between the exception stack and PD roots bookkeeping

}

#[cfg(test)]
#[cfg(not(feature = "i686-guest"))]
Copy link
Copy Markdown
Member

@simongdavies simongdavies Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical: Zero test coverage on i686 code path (Flagged by: 3/3)

All existing tests are #[cfg(not(feature = "i686-guest"))]. This PR adds ~500 lines of complex page table manipulation, CoW map building, and snapshot compaction with zero test coverage. This is security-critical VMM code.

Functions without tests:

  • build_cow_map() — walks guest-controlled page tables
  • compact_i686_snapshot() — 100+ lines of compaction logic
  • build_initial_i686_page_tables() — initial PT setup
  • i686_pt::Builder — all methods
  • virt_to_phys_all() — the i686 PT walker
  • read_pd_roots_from_scratch() — reads guest-controlled data

At minimum, unit tests should cover: the i686_pt::Builder, the CoW map builder with crafted PTs, compaction with multiple PD roots, and adversarial inputs (e.g. PD entries pointing out of bounds).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added tests.

let mut cow_map = std::collections::HashMap::new();
let scratch_base = scratch_base_gpa(layout.get_scratch_size());
let scratch_end = scratch_base + layout.get_scratch_size() as u64;
let mem_size = layout.get_memory_size().unwrap_or(0) as u64;
Copy link
Copy Markdown
Member

@simongdavies simongdavies Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Warning: Silent fallback on error (Flagged by: 2/3)

get_memory_size().unwrap_or(0) silently uses 0 on error, which means the va < mem_size check will reject ALL mappings and the CoW map will be empty. This could cause incorrect snapshot behavior (no CoW pages detected). The error should be propagated, or at minimum logged.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be resolved now.

};

// Resolve a VA through a PD to its physical frame.
let resolve_through_pd = |pd_gpa: u64, va: u64| -> u64 {
Copy link
Copy Markdown
Member

@simongdavies simongdavies Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Warning: Fallback returns VA instead of PA (Flagged by: 1/3)

Returning va (a virtual address) when the PD walk fails mixes virtual and physical addresses. The caller uses this result to get a physical page table GPA. If the PD doesn't have an entry for a user PT, this returns the un-translated address, which could cause the host to read from an incorrect location when rebuilding user page tables.

Consider returning Option<u64> and skipping the PT rebuild when resolution fails.

fs: data_seg,
gs: data_seg,
tr: tr_seg,
cr0: 0x80010011, // PE + ET + WP (write-protect for CoW) + PG
Copy link
Copy Markdown
Member

@simongdavies simongdavies Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Warning: Magic number (Flagged by: 2/3)

The x86_64 path uses named constants (CR4_PAE, CR4_OSFXSR, etc.). The i686 path should define CR0_PE, CR0_ET, CR0_WP, CR0_PG constants similarly for self-documenting code:

const CR0_PE: u64 = 1;
const CR0_ET: u64 = 1 << 4;
const CR0_WP: u64 = 1 << 16;
const CR0_PG: u64 = 1 << 31;
cr0: CR0_PE | CR0_ET | CR0_WP | CR0_PG,

Comment thread src/hyperlight_host/build.rs Outdated
// memories, and so can't share
unshared_snapshot_mem: { any(feature = "nanvix-unstable", feature = "gdb") },
// gdb needs writable snapshot memory for debug access.
unshared_snapshot_mem: { feature = "gdb" },
Copy link
Copy Markdown
Member

@simongdavies simongdavies Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Warning: unshared_snapshot_mem no longer gated on i686 (Flagged by: 1/3)

Old code: unshared_snapshot_mem: { any(feature = "nanvix-unstable", feature = "gdb") }. Now only gated on gdb. The separate_pt_bytes approach may avoid this need, but this should be explicitly validated — if any i686 code path writes to snapshot memory, it will silently fail with a read-only mapping.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be resolved now.

@andreiltd andreiltd force-pushed the nanvix-platform branch 2 times, most recently from cd065b1 to 105974d Compare April 17, 2026 12:05
Signed-off-by: Tomasz Andrzejak <andreiltd@gmail.com>
Signed-off-by: Tomasz Andrzejak <andreiltd@gmail.com>
Copy link
Copy Markdown
Member

@syntactically syntactically left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The high-level feature here (supporting x86-32 VA spaces instead of the pretending-to-be-real-mode #[cfg(feature = "nanvix-unstable")]) is great.

It seems like all of the i686 page table code is scattered around in various places and does not follow the same structure or architecture-independent API as the amd64 and aarch64 code. I don't think that's at all maintainable. Please try to use the same structures and APIs as the amd64 code, having i686 implementations in hyperlight_common/src/arch/i686/vmem.rs that implement the architecture-independent re-exports of hyperlight_common/src/vmem.rs and allowing most/all of the downstream code to use the same APIs and patterns. (It may be sensible to extract out the page table iterator code that is currently amd64 only, in which case I would expect it to not require much extra code to support i686 page tables (although, if you do this, please check that inlining the iterators still works, at least in release builds---e.g. vmem::map should inline down to basically the same thing as a handwritten loop in a single function)).

I would expect the changes to src/hyperlight_host/src/sandbox/snapshot.rs in particular to be limited to basically

  • Make Snapshot::new iterate through several different VA spaces worth of mappings, instead of just one (there ought to be no need to do anything special to avoid duplicated backing pages as long as the existing phys_seen map is kept)
  • (If necessary) minor changes to support keeping the PTs generated for the new mappings in a separate vec in the snapshot, instead of at the end of the snapshot memory where they are now on all architectures. Ideally, rather than being #[cfg] items directly in the crate, this would be controlled by some constant e.g. hyperlight_common::vmem::PA_SPACE_IS_SMALL: bool (with the naming reflecting the actual motivation for making this different across architectures).
    • It occurs to me that this may not even need any changes here---perhaps it could be achieved simply by changing the mem mgr or hypervisor manager code to avoid mapping the end of the snapshot region into the guest PA space on i686.

I have left a few other minor inline comments, but the above issue is the only thing that I think is actually quite important.

I didn't review any of the i686 address space manipulation code yet, because it seems like it will need to be refactored and generalised a little to be able to provide the same APIs that we use on other architectures. I think it is quite important that all the guest architectures implement the same APIs, so that most of the code can be architecture-independent.

/// page tables in guest memory. The `arch/i686/vmem.rs` module only compiles
/// for `target_arch = "x86"` (the guest side), so the host-side walker lives
/// here, gated behind the feature flag.
#[cfg(feature = "i686-guest")]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make more sense to have the i686/vmem.rs module used when feature = i686-guest---we should understand that the arch in hyperlight_common is always about the guest arch.

///
/// # Safety
/// The caller must ensure that `op` provides valid page table memory.
pub unsafe fn virt_to_phys_all<Op: TableReadOps>(op: &Op) -> Vec<Mapping> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this doesn't match the API of the amd64 stuff so that it can be re-exported from the architecture-independent module, like we do on other platforms?

ExeInfo::Elf(elf) => Offset::from(elf.entrypoint_va()),
}
}
/// Returns the base virtual address of the loaded binary (lowest PT_LOAD p_vaddr).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole base va sliding is a hack because of some executables we used to have + the lack of virtual memory in the guest. I don't think that any of the usual guests end up computing anything other than 0 here, so I was thinking it was worth getting rid of entirely. Do the nanvix binaries actually end up with something here? If so, can we replace the unpleasant semantics-breaking relocation with just changing the initial page tables slightly to get things mapped to the VAs that they ask for?

(Not relevant for merging this PR---just curious about answers to these questions for the future).

type T<S: SharedMemory>;
}
pub struct SnapshotSharedMemory_;
impl SnapshotSharedMemoryT for SnapshotSharedMemory_ {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected that with the description of this PR, we could get rid of this and use ReadonlySharedMemory unconditionally, since Nanvix will no longer depend on writing to the snapshot shared memory. Is that not true?

abort_buffer: Vec::new(), // Guest doesn't need abort buffer
};
host_mgr.update_scratch_bookkeeping()?;
host_mgr.copy_pt_to_scratch()?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to make the same differentiation between the i686 and x86-64 sources of the initial page table information?

/// the i686-guest path so the scratch state mirrors what it
/// looked like right after the snapshot was taken.
#[cfg(feature = "i686-guest")]
fn update_pd_roots_bookkeeping(&mut self, n_roots: usize) -> Result<()> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not a huge fan of this extra table.

If the roots are only updated in this table, how does kernel code notice that a snapshot+restore cycle has happened and update its internal source-of-truth for them? (i.e. I assume there is another copy of these hanging out in various struct task-like structures that need to be updated). If the location of those internal-source-of-truth copies of things could be provided by a hypercall or a root finder callback, it would perhaps simplify things, as well as removing this statically-allocated limit to the number of distinct address spaces in the guest.

.vm
.get_root_pt()
.map_err(|e| HyperlightError::HyperlightVmError(e.into()))?;
.map_err(|e| HyperlightError::HyperlightVmError(e.into()))?];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this were a hypercall or a callback into the embedder, it would also avoid this architecture-specific dependency.

/// CoW resolution map: maps snapshot GPAs to their CoW'd scratch GPAs.
/// Built by walking the kernel PD to find pages that were CoW'd during boot.
#[cfg(feature = "i686-guest")]
cow_map: Option<&'a std::collections::HashMap<u64, u64>>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this information necessary beyond the scope of one address space traversal? And, why is this a useful form of this information?

u64::from_ne_bytes(n)
let memoff = access_gpa(self.snap, self.scratch, self.layout, addr);
// For i686 guests, page table entries are 4 bytes; for x86_64 they
// are 8 bytes. Read the correct size based on the feature flag.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just use mem::size_of::<hyperlight_common::vmem::PageTableEntry>(). Or, if it can't because of some limitation on the const-ness of that expression, it should just be another constant and/or type synonym defined in the same place.

There should ideally be little-to-no arch-specific code in this module: the hyperlight_common::vmem and related APIs should abstract over the architecture-of-the-guest differences well enough that this code can be more-or-less entirely generic. (If there are problems with the vmem api assuming to much, we should fix them, rather than introduce more #[cfg] here).

/// The buffer stores one or more page directories (PDs) at the front,
/// followed by page tables (PTs) that are allocated on demand. All
/// entries use 4-byte i686 PTEs.
#[cfg(feature = "i686-guest")]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, why isn't this just hyperlight_common/arch/i686/vmem.rs and following the same API as the amd64 (and soon to be aarch64) code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/enhancement For PRs adding features, improving functionality, docs, tests, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants